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Notwithstanding their biological importance, Y chromosomes remain poorly known in most species. A major obstacle to 
their study is the identification of Y chromosome sequences; due to its high content of repetitive DNA, in most genome 
projects, the Y chromosome sequence is fragmented into a large number of small, unmapped scaffolds. Identification of 
Y-linked genes among these fragments has yielded important insights about the origin and evolution of Y chromosomes, 
but the process is labor intensive, restricting studies to a small number of species. Apart from these fragmentary as- 
semblies, in a few mammalian species, the euchromatic sequence of the Y is essentially complete, owing to painstaking 
BAC mapping and sequencing. Here we use female short-read sequencing and k-mer comparison to identify Y-linked 
sequences in two very different genomes, Drosophila virilis and human. Using this method, essentially all D. virilis scaffolds 
were unambiguously classified as Y-linked or not Y-linked. We found 800 new scaffolds (totaling 8.5 MbpJ, and four new 
genes in the Y chromosome of D. virilis, including JYalpha, a gene involved in hybrid male sterility. Our results also strongly 
support the preponderance of gene gains over gene losses in the evolution of the Drosophila Y. In the intensively studied 
human genome, used here as a positive control, we recovered all previously known genes or gene families, plus a small 
amount (283 kb) of new, unfinished sequence. Hence, this method works in large and complex genomes and can be applied 
to any species with sex chromosomes. 

[Supplemental material is available for this article.] 



Y chromosomes play a major role in sexual reproduction by har- 
boring master sex-determination genes in many species and male 
fertility factors in most of them (Bull 1983; Carvalho et al. 2009; 
Kaiser and Bachtrog 2010; Ezaz and Graves 2012; Hughes and 
Rozen 2012). Analysis of their origin and evolution has revealed 
unexpected biological phenomena (Rozen et al. 2003; Carvalho 
and Clark 2005; Koerich et al. 2008; Lemos et al. 2008; Murtagh 
et al. 2012), as well as general principles of evolutionary genetics, 
including the role of recombination and sex-antagonistic genes 
(Rice 1996; Charlesworth and Charlesworth 2000; Zhou and 
Bachtrog 2012). However, despite their importance, little is known 
about Y chromosomes because in many species they are hetero- 
chromatic, being composed of highly repetitive DNA that cannot 
be fully assembled with current technologies (Carvalho et al. 2003; 
Hoskins et al. 2007). The same issues apply to W chromosomes 
in ZZ/ZW sex-determination systems (Bull 1983; International 
Chicken Genome Sequencing Consortium 2004). Mammalian Y 
chromosomes contain a large euchromatic portion that nonethe- 
less is also very repetitive; in a few species (human, chimp, and 
macaque), its sequence is nearly complete, owing to painstaking 
BAC mapping and sequencing (Skaletsky et al. 2003; Hughes and 
Rozen 2012). These formidable achievements demanded a huge 
investment of time and resources and placed these Y chromosomes 
apart (in all other species, only fragmentary assemblies are avail- 
able, at best). A similar effort successfully assembled the less re- 
petitive portion of the D. melanogaster heterochromatin (Hoskins 
et al. 2007). It is telling that even in the finished human genome 
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most heterochromatic regions remain unassembled (International 
Human Genome Sequencing Consortium 2004). 

Although it is not possible to fully assemble heterochromatic Y 
chromosomes, Y-linked genes can nonetheless be assembled even if 
they are deeply buried within repetitive DNA, and this partial geno- 
mic data is very informative (Carvalho et al. 2000; Carvalho and 
Clark 2005; Koerich et al. 2008; Murtagh et al. 2012). In "whole ge- 
nome shotgun" projects (WGS), which comprise the majority of 
recent genome projects, the euchromatic portion of chromosomes 
assemble into large and easily studied scaffolds, whereas hetero- 
chromatic regions are represented by thousands of small unmapped 
scaffolds (International Chicken Genome Sequencing Consortium 
2004; Hoskins et al. 2007; Levy et al. 2007). Exons of heterochromatic 
genes and other islands of unique sequence are faithfully assembled 
but appear as isolated scaffolds because the repeat-laden introns and 
intergenic regions cannot be assembled. Further assembly frag- 
mentation in the Y-chromosome is caused by its low coverage 
(compared to the autosomes) (Carvalho et al. 2003), a consequence 
of its hemizygosity. A major obstacle to the study of the Y chromo- 
some is to identify among the many unmapped scaffolds those that 
are Y-linked. This has been done by a combination of computational 
methods that suggest candidates and a PCR test to confirm Y-linkage 
(Carvalho et al. 2000; Carvalho and Clark 2005; Koerich et al. 2008; 
see Chen et al. 2012 for W-linkage). The experimental verification 
is labor intensive when applied to hundreds of scaffolds but is 
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necessary owing to the high rate of false positives of current 
computational methods. Nearly all known Drosophila Y-linked 
genes were identified using this approach (Carvalho et al. 2000; 
Carvalho and Clark 2005; Carvalho et al. 2009; Krsticevic et al. 
2010). When technically feasible, Y-linked scaffolds can be 
identified by the preparation of separate male and female DNA 
libraries before WGS sequencing, as these scaffolds would contain 
only male reads (Krzywinski et al. 2004). This approach is not 
possible for the majority of the available genome sequences because 
they employed mixed-sex libraries (also, in mammals, sequencing 
of a single homogametic female is common practice). 

Here we show that Y chromosome sequences can be identified 
with a simple, efficient, and inexpensive method (Y chromosome 
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Figure 1. Outline of the YGS (V chromosome Genome 5can) method. 
Y-linked sequences can be efficiently identified by a comparison of the 
assembled genome with inexpensive short-reads obtained from female 
DNA: The Y-linked sequences should get no match, whereas autosomal 
and X-linked sequences should be nearly completely matched. Efficient 
removal of all types of repetitive sequences is critical because they are 
shared between the Y chromosome and the female DNA, and was ac- 
complished by a straight comparison of the short DNA words (k-mers) 
present in the assembled genome and female short-reads. We successfully 
applied the YGS method to two very different genomes, D. virilis and 
human. 



Genome Scan or YGS) (Fig. 1) suitable for all genome projects that 
include the heterogametic sex, and apply it to Drosophila and humans. 

Results 

Identification of Y-linked scaffolds in Drosophila virilis 

The Drosophila virilis genome was sequenced in 2007 using un- 
sexed embryos (Drosophila 12 Genomes Consortium 2007). A pre- 
vious search of the orthologs of the D. melanogaster Y-linked genes 
identified six genes on the D. virilis Y chromosome (Koerich et al. 
2008). Here, we apply a single lane of Illumina short-read se- 
quencing to adult female DNA (—22 Gbp; ~85-fold coverage of the 
genome; current cost US —$2000). When comparing these female 
reads with the available genome, the Y-linked scaffolds should get 
no match if all types of repetitive sequences are masked. This 
comparison can in principle be implemented using standard pro- 
grams such as RepeatMasker and BLAST, but they do not allow the 
fine-tuning needed for this particular purpose. Using a computer 
program tailored for this application ("YGS.pl") (Carvalho and 
Clark 2008; Methods), we first built a list (implemented as a bit- 
array) of all overlapping 15-bp sequences ("15-mers") present in 
the female reads. Then we built a list of all 15-mers present in only 
one copy in the assembled D. virilis genome. This list of single-copy 
15-mers efficiently removes identical repeats that would escape 
usual repeat masking programs (e.g., from gene duplications), 
while preserving the useful information provided by single-copy 
variants of transposable elements, segmental duplications, and 
other repeats (Krsticevic et al. 2010; Dennis et al. 2012; Hughes and 
Rozen 2012). Finally, we compared the two lists and obtained for 
each scaffold the proportion of unmatched single-copy /c-mers; to 
increase resolution we optionally removed fc-mers that are likely to 
be sequencing errors (Methods). As shown in Figures 2A and 3 A, 
the result is clear-cut: The distribution is sharply bimodal; the right 
peak (centered at 95%-100% unmatched fc-mers) corresponds to 
the Y chromosome, whereas the left peak (centered at 0%-5% 
unmatched fc-mers) corresponds to the X and autosomes. The few 
previously known Y-linked scaffolds of D. virilis, and —800 new 
ones (totaling 8.5 Mbp), were cleanly identified (Figs. 2A, 3A). We 
tested by PCR a sample of 15 candidate scaffolds from the right 
peak, and all were Y-linked (Supplemental Table SI). Additional 
validation of the method is provided by the known D. virilis 
Y-linked genes: All 52 scaffolds that encode them are located in 
the right peak. Fifteen scaffolds (out of 1 186) produced ambigu- 
ous results, containing above —25% and below —80% unmatched 
k-mets. These intermediate scaffolds are small (average size of 
2.7 kb and range of 1.1-5.8 kb), very repetitive (with many BLASTN 
hits in the D. virilis genome), and amount to 40 kb (0.02% of the 
genome size). They probably originate from misassembled re- 
peats. At any rate, essentially all D. virilis genome sequences can be 
safely classified as Y-linked or not Y-linked without experimental 
verification. 

We searched for protein coding genes among the newly 
identified Y-scaffolds, and we found four new functional genes, 
11 pseudogenes, and —10 scrambled copies of the mitochondrial 
genome. These mitochondrial copies are very similar and show 
signs of misassembly, so further studies (e.g., Krsticevic et al. 2010) 
are needed to clarify whether or not any mitochondria-derived 
gene is functional. We focus here on the four functional genes. 
Gf 19835 belongs to the M20 dipeptidase gene family; phyloge- 
netic and synteny analyses strongly suggest that its ortholog was 
lost in the melanogaster group of species, being present in all 
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Figure 2. Identification of Y-linked scaffolds in the D. virilis (A) and 
human (B) genomes. The peak on the right corresponds to Y-linked scaf- 
folds, and the peak on the left corresponds to autosomal and X-linked 
scaffolds. (Abscissa) Proportion of scaffold sequence not matched by female 
short reads (in percentage unmatched single-copy /c-mers); (ordinate) 
number of scaffolds. (A) D. virilis genome, CA assembly (1186 scaffolds, 
1 65 Mbp of sequence excluding gaps; see Methods). There are 807 Y-linked 
scaffolds (total sequence: 8.5 Mbp); only a few were known before. 
(B) Human genome, HuRef assembly (Levy et al. 2007; 4606 scaffolds, 
2796 Mbp of sequence excluding gaps). There are 1 19 Y-linked scaffolds 
(total sequence: 18.4 Mbp); as expected, most of their sequence was al- 
ready known to belong to the Y chromosome (Skaletsky et al. 2003). 



other sequenced Drosophila; its D. pseudoobscura ortholog GA25180 
(which is autosomal) is expressed only in males (Supplemental 
Fig. SI). The three other genes, G] 19633, G] 11 126, and G) 18574, 
have autosomal orthologs in D. melanogaster (CGI 1 71 9, CG2964, 
and JYalpha), which are expressed only in males and encode, re- 
spectively, a conserved sperm tail structural protein, a male-specific 
pyruvate kinase present in the sperm proteome, and a sperm- 
specific subunit of the Na,K-ATPase (Supplemental Figs. S2-S4). 
Hence, these four genes fit the general pattern of Drosophila Y-linked 
genes, formerly autosomal male-specific genes transposed to the 
Y-chromosome (Carvalho et al. 2009). In order to determine when 
they were acquired by the Y-chromosome, we investigated other 
Drosophila species using PCR and synteny analysis as described in 
Koerichetal. (2008). Three genes (GJ19835, GJ19633, andGflll26) 
moved to the Y-chromosome after the split between D. melanogaster 
and D. virilis (Fig. 4; Supplemental Figs. S1-S4; Supplemental 
Table S2). In contrast, we found that GJ18574/JYalpha belong to 
the ancestral Drosophila Y-chromosome; its autosomal location in 
D. melanogaster resulted from a Y-to-autosome movement in the 
ancestor of the melanogaster subgroup of species. 

Identification of Y-linked scaffolds in the human genome 

The YGS method worked very well in D. virilis and other Drosophila 
species (not shown), but these genomes are fairly small and not 
very rich in repetitive DNA. Furthermore, they were sequenced 
using inbred strains, whereas in outbred genome sequences 



(e.g., humans and many other organisms) males and females 
will differ not only in the sex chromosomes but also at poly- 
morphic sites. We found that the YGS method cleanly identify the 
Y chromosome in the reference human genome (Fig. 5). As a more 
stringent and realistic test we applied it to a WGS assembly. Since 
the human Y has been intensively studied (Skaletsky et al. 2003; 
Hughes and Rozen 2012), the main point here is to test whether 
previously known Y sequences could be identified. We down- 
loaded the WGS assembly "HuRef," which came from a male with 
Great Britain ancestry (Levy et al. 2007), and the Illumina short 
reads from 36 Great Britain women, produced by the 1000 Ge- 
nomes Project (The 1000 Genomes Project Consortium 2010). We 
then applied the same general procedures described above for D. 
virilis, except that we increased k-met size from 15 to 18 to allow for 
the larger genome size (Methods). Again the distribution of 
matching is bimodal, with Y-linked scaffolds centered around 95% 
unmatched single-copy fc-mers, and non-Y centered below 5% 
(Figs. 2B; 3B). The Y chromosome amounts to —2% of the total 
genome sequence in humans (International Human Genome Se- 
quencing Consortium 2004) and —19% in Drosophila (Carvalho 
et al. 2009), but its impressive appearance in Figure 2A when 
compared to Figure 2B basically reflects a more fragmented as- 
sembly in Drosophila. More relevant differences are the promi- 
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Figure 3. Identification of Y-linked scaffolds in the D. virilis (A) and 
human (B) genomes (scatter plot). The figure complements Figure 2A,B, 
adding the scaffold size and mapping. Each dot represents one scaffold in 
the assembled genomes; red dots are scaffolds previously known to be 
Y-linked, and blue dots are unmapped or not Y-linked scaffolds. (Abscissa) 
proportion of scaffold sequence not matched by female short reads (in 
percentage unmatched single-copy /c-mers); (ordinate) scaffold size. (A) 
D. virilis genome (CA assembly).There are 15 intermediate scaffolds (de- 
fined as those having between 25% and 80% unmatched single-copy 
/c-mers); they are all small (total size: 40.4 kb ; 0.02% of the genome), and 
they seem to originate from misassembled repeats (see Results). (B) Hu- 
man genome (HuRef assembly). There are 51 intermediate scaffolds; all 
are Y-linked and many of them could be traced to misassembled seg- 
mental duplications involving the Y chromosome, such as the two marked 
with arrows (accessions DS4861 71 and DS486351; see Results). 
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Figure 4. Gene gains and losses in the DrosophilaY. Gene location (Y-linkage vs. autosomal/X-link- 
age) was determined by PCR. Direction of the movements (gene gains, red arrows; gene losses, blue 
arrows) was inferred by synteny and parsimony (Supplemental Figs. SI -S4). For the six ancestral genes 
(dashed arrows), there is no close outgroup for inferring the direction (gain versus loss) (Koerich et al. 
2008). Genes were labeled with the names of the D. melanogaster (JYalpha, CGI 7 7/9, CC2964) or D. 
virilis (GJ1 9835) orthologs. Data for genes jYalpha (abridged as "jYa"), GI19835, CG11719, and CG2964 
came from the present study, and the remaining genes from Koerich et al. (2008). 



nence and origin of the "intermediate scaffolds" in the human 
genome. We classified 51 scaffolds as intermediate (i.e., with am- 
biguous position in Fig. 3B), and three of them are quite large 
(DS486460, 269 kb; DS486351, 885 kb; DS486171, 3.9 Mbp). Also 
differently from Drosophila, many of these intermediate scaffolds 
do not have tens or hundreds of BLASTN hits in the genome but 
instead have one or a few hits. As detailed in the next section, we 
found that all intermediate scaffolds are Y-linked, and that many of 
them were caused by misassembled segmental duplications in- 
volving the Y chromosome. Note, however, that segmental du- 
plications are perfectly classified (as Y- or not Y-linked) by the YGS 
method if they were correctly assembled (Fig. 5). 

On the whole, we found 119 Y-linked HuRef scaffolds 
(18.4 Mbp), which contain all known single-copy human Y-linked 
genes and at least one representative of the multicopy ones (be- 
low). A sequence comparison is needed to distinguish among the 
119 scaffolds those scaffolds that are covered by the reference Y 
from those containing new sequences. We searched potentially 
new Y-linked sequences using two conservative and complemen- 
tary approaches (Methods). We found that most of the scaffolds 
(85 of 119) and the vast majority of the sequence (18.165 Mbp of 
18.371 Mbp) are entirely contained (or nearly so) within the ref- 
erence Y sequence, whereas 34 small scaffolds (totaling 206 kb) are 
primarily composed of new sequence (Supplemental Fig. S5; Sup- 
plemental Table S4, column 5). These 34 scaffolds contain mostly 
repeats (76% of the sequence; RepeatMasker search), including 
known types of Y satellite sequences (Supplemental Table S4); six 
of them (DS487348, DS487367, DS487401, DS488597, DS489331, 
and DS490176) match the few known sequences (accessions 
AY598345-AY598347, AC068123) of the large heterochromatic 
Yql2 band, which comprises 40 Mbp and has not been sequenced 
(Skaletsky et al. 2003; Jehan et al. 2007; Hughes and Rozen 2012). 



The Yql2 band (and the Y centromere) are 
the likely source of many of these 34 
scaffolds; and hence, they may provide 
entry points for the heterochromatic re- 
gions of the human Y. Within the 85 
scaffolds that closely match the refer- 
ence Y, we found five small regions of 
new sequence (in scaffolds DS486512, 
DS486519, DS486628, and DS486277; total 
size: —76 kb) (Supplemental Table S3). Four 
regions are located in or very near the end of 
both the HuRef and reference Y scaffolds; 
dot plots show a good alignment of both 
assemblies until the end of the reference 
scaffold, followed by the new sequence 
present only in the HuRef scaffold (Supple- 
mental Fig. S6). Hence, these new sequences 
likely partially fill gaps in the reference Y 
sequence. The fifth new region (scaffold 
DS486227, approximate position 1628481- 
1637920) is an 11 -kb insertion in the mid- 
dle of the reference scaffold NT011875 
(Supplemental Table S3). We later found 
that this insertion is a polymorphism cov- 
ered by high-quality sequence (fosmid 
AC236424, obtained as part of the Human 
Genome Structural Variation Project [Kidd 
et al. 2008] and BAC clone AC245170). 
The rediscovery of this region is another 
proof of the method's power. As com- 
monly seen in human Y chromosome euchromatic sequences 
(Skaletsky et al. 2003), these five regions contain repeats and du- 
plications from other chromosomes (Supplemental Table S3). All 
BLASTX hits came from incomplete copies of genes from other 
chromosomes, frequently with in-frame stop codons, so they are 
clearly pseudogenes. A BLASTN search against the NCBI human EST 
database does not show any transcript that came from these regions 
(not shown). In short, there is no sign of functional genes. These 
newly found euchromatic sequences may be useful targets for fin- 
ishing. On the whole, we found 283 kb of new sequence (—1% of 
the reference Y). Using the 1000 Genomes Project reads, we con- 
firmed that they are present in other males and absent in females; 
hence, they are newly found pieces of the human Y instead of artifacts 
or rare polymorphisms in the HuRef assembly (Fig. 6; Methods). 



Intermediate scaffolds in the human data and segmental 
duplications 

In contrast with Drosophila, in the human genome some scaffolds 
that produce ambiguous results are quite large and do not seem to 
result from trivial causes such as misassembled highly repeated 
sequences. While investigating their origin, we considered four 
hypotheses: (1) Binomial sampling error; (2) autosomal or X-linked 
scaffolds displaced toward the Y-peak (i.e., the low-match region in 
the right side of Fig. 3B) due to sequencing errors or rare poly- 
morphisms in the HuRef assembly; (3) chimeric scaffolds involving 
Y-linked and non-Y-linked sequences; and (4) Y-linked scaffolds 
displaced toward the autosomal/X-peak (i.e., the high-match region 
in the left side of Fig. 3B) due to incomplete detection of repetitive 
fc-mers. Binomial sampling error is quantitatively irrelevant: Due 
to the high depth of female reads (38x), the proportion of un- 
matched fc-mers expected by chance in an autosomal scaffold is 
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Figure 5. Test of the YGS method in the reference human genome. 
Each dot represents a human chromosome in the reference assembly 
(International Human Genome Sequencing Consortium 2004). (Blue 
dots) Autosomes and the X; (red dot) Y chromosome. The stars are the 
X-linked (blue star) and Y-linked (red star) copies of a 3.9-Mbp segmental 
duplication ("XTR"); the two copies are 98.8% identical (Ross et al. 2005). 
The region is continuous in the X chromosome (accession NT_01 1 651 .17; 
coordinates 11754754-15671817) and interrupted by a large insertion 
in the Y (Hughes and Rozen 201 2); we used only the homologous region 
of the Y-linked copy (accession NT_01 1896.9; coordinates 268439- 
3453096 and 3751 260-3967080). Note that the two copies of this seg- 
mental duplication are correctly identified as Y and not Y-linked, despite 
their very high sequence identity. (Abscissa) Proportion of scaffold se- 
quence not matched by female short reads (in percentage unmatched 
single-copy /c-mers); (ordinate) chromosome size. 



very small (approximately 10~ 38 , assuming a Poisson distribution, 
but this ignores polymorphism and sequencing bias). Regarding 
sequencing errors/rare polymorphisms, their effect was virtually 
eliminated by the use of male short reads to validate /c-mers (see 
Methods, Removal of Sequencing Errors in the Assembled Genomes 
section). Regarding chimeric scaffolds, we have not found any ex- 
ample in Sanger-based genome projects (such as HuRef), but it re- 
mains a possibility (see below, Application of YGS to Genomes As- 
sembled from Short Reads section). The last possible cause of 
intermediate scaffolds, Y-linked scaffolds displaced toward the 
autosomal/X-peak due to incomplete detection of repetitive /c-mers, 
proved to be relevant for the human genome. Specifically, we could 
trace a number of intermediate scaffolds to misassembled (col- 
lapsed) segmental duplications in the HuRef assembly. 

Segmental duplications (SD) are regions larger than 1 kb that 
are not transposable element copies and have sequence identity 
>90% (International Human Genome Sequencing Consortium 
2004). They are common in the human genome and create major 
problems because they frequently are collapsed, particularly in 
unfinished WGS assemblies (Bailey et al. 2001; Alkan et al. 2011), 
but also in the finished human genome (Dennis et al. 2012). SD 
would be perfectly classified by our procedures if they were cor- 
rectly assembled (Fig. 5). On the other hand, collapsed segmental 
duplications will cause repetitive /c-mers to be classified as single- 
copy; if one SD copy is located in the Y chromosome and the 
other is X-linked or autosomal, the corresponding scaffold will 
get many matches from female reads in "single-copy" /c-mers and 
will be displaced toward the autosomal/X peak, the exact amount 
of displacement depending on how much non-SD sequence it 
contains. Note also that the assembled sequence probably will 
contain patches of the different copies of the SD. 



If the above hypothesis for the origin of intermediate scaf- 
folds is correct, then they should be Y-linked. Indeed, as shown in 
the previous section, alignment with male and female short reads 
showed that all intermediate scaffolds are Y-linked. Furthermore, 
when we examined a sample of intermediate scaffolds, we found 
several cases of collapsed segmental duplications involving the 
Y chromosome. For example, a BLASTN search of scaffold DS487 199 
(53% unmatched) under high stringency (word size of 128) de- 
tects only one full copy in the HuRef assembly (i.e., itself), but 
four full copies in the reference human assembly: the Y-linked 
NT_1 13819 (with 100% identity), and three autosomal (NT_1 13888, 
NT_113889, and NT_113958; all above 97% identity). Analogous 
results were obtained for DS488767 (59% unmatched) and other 
scaffolds. By far the most serious case of misassembled SD we 
found in the HuRef assembly was the X-transposed region ("XTR"), 
a 3.9-Mbp block that transposed from the X to the Y ~5 million 
years (Myr) ago; the X and Y copies have 98.8% identity (Page et al. 
1984; Ross et al. 2005). As acknowledged by Levy et al. (2007), the X 
and Y copies of this region are collapsed in the HuRef assembly; we 
found that most of the XTR sequence ended up in two large Y-linked 
scaffolds plus 26 small scaffolds (BLASTN search; not shown). Seven 
of the 51 intermediate scaffolds, including the two largest 
(DS486171, 3.9 Mbp, 14% unmatched /c-mers; DS486351, 957 kb 
75%) trace to this misassembled segmental duplication. When 
we added the missing region of the X chromosome to the 
HuRef assembly (e.g., accession NT_011651.17, between positions 
11946708-1554301) and repeat the whole analysis with the YGS 
method, the DS486171 and DS486351 scaffolds got the typical 
low matching of Y-chromosome sequences (85% and 93% un- 
matched; as expected, the number of single-copy /c-mers decreased, 
e.g., from -1,750,000 to 295,000 in DS486171). This clearly illus- 
trates the effect of collapsed segmental duplications on the de- 
tection of Y-linked scaffolds. It is worth noting that misassembled 
SD are not the sole source of intermediate scaffolds; all six scaf- 
folds that came from the heterochromatic band Yql2 are inter- 
mediate (Supplemental Table S4). These six scaffolds came from 
highly repetitive regions (Jehan et al. 2007; see also Supplemental 
Table S4, column 6), so as in Drosophila, these highly repetitive 
regions can also give ambiguous results, possibly because they 
have unassembled copies in other chromosomes. 

Coverage of the reference Y chromosome sequence by the 119 
HuRef Y-linked scaffolds 

It is important to know what proportion of the Y-linked genes (and 
other important sequences) we can expect to detect by the appli- 
cation of the YGS method to a draft assembly. Note that this pro- 
portion reflects both the quality of the WGS assembly (Alkan et al. 
201 1) and the efficiency of the YGS method. In order to answer this 
question for the HuRef assembly, we did a BLASTN search of the 
coding sequences of all known human Y-linked genes (of the male- 
specific region), against the 119 Y-linked HuRef scaffolds detected 
by the YGS method (we took one representative in the case of 
multicopy genes). All 28 genes (Skaletsky et al. 2003; Hughes and 
Rozen 2012) are present among the 119 scaffolds, 20 of them with 
an essentially complete coding sequence. The ZFY gene, the worst 
case, was missing 20% of its sequence. 

Application of YCS to genomes assembled from short reads 

Nearly all recent genome projects employed short reads (Illumina, 
454, Solid) instead of the older Sanger reads, which were used in 
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the genome assemblies analyzed above. This makes it important to 
know how the YGS method would perform on short-read genome 
assemblies. The higher coverage allowed by the short reads can 
produce an assembly with fewer gaps in single-copy regions, which 
is particularly useful for assembling Y-linked scaffolds (due to their 
inherently lower coverage when compared to the autosomes), but 
the shorter reads increase fragmentation and possibly the chance 
of chimerism and other misassemblies. For example, short-read 
assemblies are more prone to misassemble segmental duplications 
(Alkan et al. 2011). We did a preliminary analysis of the D. kikkawai 
genome, which was assembled using only short reads (S Richards, 
J Qu, H Jiang, M Batterton, K Blankenburg, S Gubbala, Y Han, 
J Jayaseelan, D Kalra, C Kovar, et al., unpubl.; accession number 
AFFH00000000.2), and we found the typical bimodal distribution of 
matches (Supplemental Fig. S7). The four previously known Y-linked 
genes of D. kikkawai are contained in scaffolds with a very high 
proportion of unmatched fc-mers, except for a piece of the kl-2 gene 
(Supplemental Table S5). This gene was split into three scaffolds, 
two with the typical match rate of the Y-chromosome, and one 
intermediate (KB459848; 15.7% unmatched fc-mers). The latter 
seems to contain a misassembly, being composed by one contig 
that contains the kl-2 gene and has the typical signature of 
Y-linkage (AFFH02007563; 99.8% unmatched fc-mers); a second 
Y-linked contig (AFFH02007564; 83% unmatched); and 19 other 
contigs that behave as autosomal or X-linked (AFFH02007565- 
AFFH02007583; average: 0.6% unmatched). This pattern suggests that 
KB459848 is chimeric, or that it is part of a misassembled segmental 
duplication (such as the DS486171 scaffold of the human assembly) 
(Supplemental Fig. S5); settling this point would require a finished 
genome sequence. Given the increased risk of chimerism and mis- 
assembly of segmental duplications, it may be safer to run YGS with 
both scaffolds and contigs when dealing with short-read assemblies. 
Despite these limitations, YGS will be a useful tool for these as- 
semblies as well, since all known Y-linked genes of D. kikkawai 
would have been detected. 

Discussion 

The study of Y chromosomes has been hampered by their frag- 
mentation after genome sequencing and by the difficulty of iden- 
tifying their pieces among the many unmapped scaffolds. Previous 
studies used methods such as "staggered TBLASTN hits" (Carvalho 
et al. 2000; 2001), "low-shotgun depth" (Carvalho et al. 2003; 
Carvalho and Clark 2005), and "testis ESTs x armU" (Vibranovski 
et al. 2008), which had a false positive rate of 40%-65% and an 
unknown rate of false negatives (none of them was carried ex- 
haustively). This high false-positive rate made experimental con- 
firmation of Y-linkage obligatory and time consuming. In compar- 
ison, the YGS method classified nearly all scaffolds of the D. virilis 
genome assembly as Y-linked or not Y-linked, without false positives 
and false negatives, making the laborious and time-consuming ex- 
perimental confirmation of Y-linkage much simpler (we did it only 
with the few scaffolds that contain functional genes, and all proved 
to be Y-linked). As a cautionary note, we should keep in mind that 
the Y chromosome differs greatly among species, so more experi- 
mental validation may be necessary in some genomes or assemblies. 
In the human genome, some ambiguity was introduced by mis- 
assembled segmental duplication, which caused some Y-linked 
scaffolds to be located between the Y and autosomal/X peaks of 
the distribution (Figs. 2B, 3B). Hence, particularly when dealing 
with these complex genomes, scaffolds that are suspiciously dis- 
tant from the "autosomal/X peak" of the distribution should be 



considered as probably Y-linked, even if they are distant from the 
"Y peak" (e.g., scaffold DS486171) (Fig. 3B). Another assembly- 
related problem is that draft genomes, where the YGS method will 
be most useful, will have sequence gaps, which obviously will limit 
the identification of Y-linked sequences. Despite these problems, if 
the assembly we examined (HuRef) belonged to some poorly 
known mammalian genome, all its single-copy Y-linked genes and 
at least one representative of each multicopy gene would have 
been discovered. In sum, the YGS method worked nearly perfectly 
with a Drosophila genome, and worked well in a complex genome. 
It also worked well in a preliminary experiment with the D. kikkawai 
genome, which was assembled from short reads. Hence, it seems 
that the problem of identification of Y-linked scaffolds is solved. As 
discussed below, application of this method has yielded interesting 
biological insights for both the D. virilis and the human genome. 
Since it is efficient, simple, and inexpensive, it seems likely that 
more will come in the near future. 

JYalpha, a gene involved with reproductive isolation, is part 
of the ancestral Drosophila Y chromosome 

G] 18574, the ortholog of JYalpha, is one of the four new genes we 
found in the D. virilis Y chromosome. We found that it belonged to 
the ancestral Drosophila Y chromosome, and it moved to an auto- 
some (the chromosome 4 of D. melanogaster) in the ancestor of the 
melanogaster subgroup (Fig. 4). Masly et al. (2006) found a second, 
more recent JYalpha movement: It transposed from chromosome 4 
to 3R in the ancestor of D. simulans (Supplemental Fig. S8). As a re- 
sult, certain D. melanogaster ID. simulans hybrids lack the gene and 
are sterile. The finding of a gene involved in reproductive isolation 
in the ancestral Drosophila Y chromosome suggests a simple expla- 
nation for the key role of this chromosome in several cases of re- 
productive isolation between Drosophila species (e.g., Pantazidis 
et al. 1993; Sweigart 2010), especially because sex-chromosomes are 
much more prone than autosomes to cause imbalances in hybrids 
(Long et al. 2012). It will be interesting to verify whether gene 
movements to/from the Y chromosome occurred in these species. 

Gene gains outnumber gene losses by 11-fold in Drosophila 

Current theory on Y chromosome evolution states that X and 
Y chromosomes evolved from an ordinary pair of autosomes after 
one of them acquired a male-determining function and became 
a proto-Y. Massive gene loss on the proto-Y would follow, gen- 
erating a typical Y chromosome in which the few remaining 
genes are mostly shared with the X (Bull 1983; Rice 1996; 
Charlesworth and Charlesworth 2000). Although this theory is 
very well-supported in mammals and other groups (e.g., Skaletsky 
et al. 2003), the evidence in Drosophila does not support it (Carvalho 
et al. 2009). For example, a previous study with ten Drosophila 
species (Koerich et al. 2008) estimated that the rate of gene gain is 
10.7-fold higher than the rate of gene loss (95% confidence in- 
terval: 2.2-51.3). At that time, the Y-linked gene content was well- 
known only in D. melanogaster; therefore, gene gains could only be 
measured in this lineage and gene losses only in the other species. 
Hence, the 10.7 estimate requires the assumption that the rates of 
gene gain and gene loss are homogeneous across lineages (Koerich 
et al. 2008) and may be incorrect if this assumption is wrong. 
Given the D. virilis data reported here, we can now directly measure 
gene gain and gene loss rates in the same lineages, without the 
above assumption. Namely, there were four gene gains and zero 
gene losses across 63 Myr in the D. virilis lineage, and the corre- 
sponding values in the D. melanogaster lineage are seven gene gains 
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Table 1 . Gene gains and losses in the Drosophila Y chromosome 



Method of estimation 


Reference species 


Gene gains 


Gene losses 


Unbiased gain/ ratio (95% CI) 


P" 


Assumption free b 

Homogeneous gain loss 0 

Homogeneous gain loss 
Homogeneous gain loss 


D. virilis + 

D. melanogaster 

D. virilis + 

D. melanogaster 

D. virilis 

D. melanogaster 


1 1 genes/1 26 Myr 

1 1 genes/126 Myr 

4 genes/63 Myr 
7 genes/63 Myr 


1 genes/1 26 Myr 
3 genes/338 Myr 

3 genes/275 Myr 

2 genes/275 Myr 


11.0 (1.4-85.2) 

7.5 (2.1-26.7) 

4.9 (1.1-22.0) 
10.7(2.2-51.3) 


0.022 

0.002 

0.037 
0.003 



Significance level under the null hypothesis of equal gain and loss rates. 

b Gene gains and gene losses measured in the D. melanogaster and D. virilis branches only (the other species were ignored; Supplemental Material). 
c Gene gains measured in the branches of the reference species, gene losses in the remaining sequenced species, with bias correction as in Koerich et al. 
(2008). This procedure assumes homogeneous gain and loss rates across the lineages. 



and one gene loss across 63 Myr (Fig. 4). These values imply a gene 
gain/gene loss ratio of 11.0 (P = 0.022, Poisson regression) (Table 1; 
Supplemental Methods). Similar results were obtained by applying 
the method described in Koerich et al. (2008) to the combined 
D. virilis and D. melanogaster data or to each species separately 
(Table 1) and by approximate Bayesian computation (Supple- 
mental Fig. S9). Hence, independent data from two species, ana- 
lyzed by different methods, support the conclusion that the gene 
content of the Drosophila Y indeed increased, at least in the last 
—63 Myr. This suggests that the canonical model of Y chromosome 
evolution, based on gene degeneration and loss, is not valid for all 
organisms (Carvalho et al. 2009). 

Completeness of the euchromatic sequence of the human Y 

The total amount of new sequence we found in the human 

Y chromosome is small (283 kb), and no new large scaffold was 
found, in accordance with the conclusion that the reference 

Y chromosome sequence is essentially complete. However, it may 
be argued that the methods we used are overly conservative or that 
the HuFcef assembly is too incomplete; and if this is true, then our 
data would say little about the completeness of the reference Y. In 
order to evaluate this possibility, we repeated the k-mer compari- 
son procedure (see Methods, Detection and Validation of New 
Human Y-linked Sequences section), removing from the reference 

Y one of its most elusive regions, a 450-kb euchromatic island 
buried within the pericentromeric repeats and found only in 2005 
(coordinates 13193955-13748578 of NC_000024.9) (Kirsch et al. 
2005). This mimics a missing sequence in the reference Y and was 
indeed its actual situation before 2005. We found that the missing 
sequence would have been immediately detected; a large portion 
of the 450 kb is covered by four HuFcef scaffolds, all with a moderate 
or high proportion of unmatched fc-mers (DS488544, 4.7 kb, 77% 
unmatched; DS487120, 10.4 kb, 66%; DS486616, 40.3 kb, 82%; 
DS486460, 269.5 kb, 74.2%). Thus, it is unlikely that another re- 
gion similar to this 450-kb euchromatic island is missing in the 
reference human Y sequence. This is relevant since only a small 
portion the 40-Mbp Yql2 band has been sequenced (Skaletsky 
et al. 2003; Jehan et al. 2007; Hughes and Rozen 2012). Also re- 
markable is the small total size of the scaffolds likely derived from 
these heterochromatic regions (34 scaffolds; 206 kb), confirming 
its extremely low sequence complexity. However, we cannot ex- 
clude the possible existence of segmental duplications with ex- 
treme sequence identity; such regions can be completely missed in 
the HuRef and other WGS assemblies and even in the finished 
human genome. For example, a segmental duplication originated 
four copies of the SRGAP2 genes (Dennis et al. 2012), which se- 
quence divergences range from 0.014% to 0.58%; a BLASTN search 



of the SRGAP2 mRNA (NM_015326) detected only one full copy and 
very small additional pieces in the HuRef assembly (not shown). 
Until recently, the whole region was also collapsed in a single copy 
in the reference human genome sequence. 

Identification of Y-linked sequences in complex genomes 
and segmental duplications 

The YGS method worked nearly perfectly in D. virilis and probably 
will do the same in other Drosophila-\ike genomes. In mammals (and 
possibly other complex genomes) there will be the issue of SD 
misassembly, a problem that affects many genomic analyses such as 
gene mapping, and polymorphism patterns (Bailey et al. 2001; 
Dennis et al. 2012). Detection of Y-linked scaffolds through female 
short-read sequencing is one more on this list. SD are much more 
common and longer in complex genomes (e.g., primates) than in 
Drosophila or in Caenorhabditis (Samonte and Eichler 2002; Fiston- 
Lavier et al. 2007) and are more frequently misassembled in WGS 
projects that use short reads, when compared to Sanger-based pro- 
jects (Alkanet al. 2011) such as HuRef. These factors should be taken 
into account while interpreting the distribution of matching (e.g., 
Fig. 3); and in most cases, one may want to consider as candidate to 
Y-linkage those scaffolds having a relevant proportion of un- 
matched fc-mers instead of only those close to 100%. The cutoff 
we chose for this "relevant proportion of unmatched /c-mers" 
(25% for small scaffolds) is conservative, so we certainly missed 
some Y-linked sequence. For the HuRef assembly, the missing 
amount is small: The really doubtful scaffolds are the small ones 
that are below the 25% cutoff but not too close to the autosomal/ 
X peak (e.g., <10 kb and with the proportion of unmatched /c-mers 
between 5% and 25%); there are 173 such scaffolds, amounting to 
740 kb (0.03% of the human genome). In other words, the vast 
majority of the sequence could be reliably identified as Y- or not 
Y-linked, even in an unfinished assembly of a complex genome. 

Other limitations of the YGS method and alternative 
approaches 

In the previous section we addressed the issue of segmental dupli- 
cations, which probably is the main limitation of the YGS method. 
Here we discuss other limitations as well as alternative approaches. 

Pseudoautosomal regions (PAR), when present, cannot be 
detected by the YGS method, because they are fully shared between 
the X and Y chromosomes. 

Several genome projects, including virtually all mammalian 
genome projects in recent years, sequenced only the homogametic 
sex to increase the coverage of the X (or Z) chromosome (e.g., 



Genome Research 1901 

www.genome.org 



Carvalho and Clark 



silkworm and rat) (Rat Genome Sequencing Project Consortium 
2004; Xia et al. 2004). In these cases, the study of the Y (or W) 
chromosomes would require a separate sequencing effort targeting 
the heterogametic sex. 

Assembly quality is expected to affect the detection of Y-linked 
scaffolds. Unassembled sequences obviously cannot be mapped, 
and chimeric scaffolds may produce ambiguous results if they 
misjoin Y-linked and autosomal (or X-linked) sequences, as possibly 
exemplified by the KB459848 scaffold of D. kikkawai (see Results). 
Sequencing errors create noise if they were not efficiently removed 
during the assembly, which may partially obscure the difference 
between Y-linked and not Y-linked small scaffolds; this problem is 
much ameliorated by k-mer validation (Supplemental Fig. SI 6). 
Contaminant scaffolds are spuriously detected as Y-linked if control 
male reads are not used (Supplemental Fig. S9). Assemblers differ in 
their performance, sometimes dramatically (e.g., Supplemental Fig. 
S14), but unfortunately it is difficult to obtain a reliable measure of 
assembly quality without a reference genome (Salzberg et al. 2012). 
Thus, as in all analyses of unfinished genomes, the YGS perfor- 
mance may be affected by assembly errors that are not easy to pre- 
dict in advance. 

Finally, it is worth considering alternative methods for the 
detection of Y-linked scaffolds. We already mentioned that the best 
strategy is to generate and sequence separate male and female li- 
braries (Krzywinski et al. 2004; Carvalho et al. 2009). Two recent 
studies showed that comparative genome hybridization can iden- 
tify X- and Y-linked scaffolds. Baker and Wilkinson (2010) labeled 
male and female DNA of stalk-eyed flies with different dyes and 
hybridized them to arrays of probes representing EST sequences. 
This method allowed identification of both X- and Y-linked ESTs, 
due to the copy number differences between males and females. 
He et al. (2012) used chromosomal deletions and a conceptually 
similar design to place the unmapped scaffolds of D. melanogaster 
into the heterochromatic regions of all chromosomes, including the 
Y. The identification of both X- and Y-linked scaffolds is an attractive 
feature of this approach. On the other hand, the requirements of 
sequence uniqueness are much higher for DNA hybridization 
than for the direct sequence comparison employed by YGS: He 
et al. (2012) could design probes for < 20% of the unmapped scaf- 
folds, whereas —99% of the D. virilis scaffolds could be reliably 
classified as Y-linked or not Y-linked (Fig. 3A). It will be very in- 
teresting to apply YGS to D. melanogaster and to directly compare 
the two mapping approaches. 

Methods 

Genomic sequences 

We used the CAF1.2 (Drosophila 12 Genomes Consortium 2007) and 
the CA assemblies of Drosophila virilis (see Comparison of the 
D. virilis Assemblies, below) and two versions of the human ge- 
nome, the finished version Build 37 (GRCh37.p9; International 
Human Genome Sequencing Consortium 2004) and the un- 
finished WGS assembly HuRef (Levy et al. 2007). We also used 
the D. kikkawai short-read assembly (S Richards, J Qu, H Jiang, 
M Batterton, K Blankenburg, S Gubbala, Y Han, J Jayaseelan, D Kalra, 
C Kovar, et al., unpubl.; accession number AFFH00000000.2). 

Short reads 

D. virilis Illumina female reads were produced at Macrogen in an 
Illumina HiSeq 2000 sequencer (one lane, 100-bp single-end; 22 



Gbp of raw sequence or ~85-fold coverage of the genome after 
filtering) from the same inbred strain used in the genome project 
(Drosophila 12 Genomes Consortium 2007). D. kikkawai females 
were similarly sequenced, except for the lower coverage (20-fold). 

To reduce the interference of polymorphism in the human 
data, we used the Illumina short reads from the available 36 females 
of Great Britain ancestry (The 1000 Genomes Project Consortium 
2010), two FASTQ files from each (listed in Supplemental Data File 
SI, Illumina_reads.txt), which yield ~38-fold coverage after filtering. 
Male reads were used to detect contaminant scaffolds (Supplemen- 
tal Fig. S9) and to remove sequencing errors in the HuRef assembly 
(below; Supplemental Fig. Sll), they came from 40 males of Great 
Britain ancestry (44-fold coverage after filtering) (The 1000 Ge- 
nomes Project Consortium 2010). 

We used a high coverage of female reads to test the limits of 
the method, but pilot experiments with Drosophila (0.6-fold cov- 
erage) (Carvalho and Clark 2008) and humans (fivefold coverage 
from a single female) achieved good results (Supplemental Fig. 
S12). Low coverage introduces significant binomial sampling error 
and hence reduces resolution for small scaffolds. It is difficult to 
provide a more precise guidance on the minimum female se- 
quencing coverage required to accurately detect Y-linked scaffolds 
because it depends on many factors such as genome properties 
(e.g., amount of segmental duplications), assembly quality (e.g., 
fragmentation), and the error rates in the genome and female short 
reads (e.g., Supplemental Fig. S14). 

We found that removal of sequencing errors in the female 
short reads is important in order to achieve a good resolution. This 
was done using two criteria: (1) masking low quality bases; and (2) 
removing /c-mers that are too rare in comparison to the genomic 
coverage, as detailed in Kelley et al. (2010). Specifically, for the 
D. virilis short reads, the sequencing errors were filtered by masking 
bases with phred score <20 and removing 15-mers that are present 
less than 5 times. For the human short reads, we used a less strin- 
gent filtering (phred score <10; frequency cut-off of 2), attempting 
to preserve rare variants. Although we initially did the quality fil- 
tering and frequency cutoff with our YGS.pl program, we found 
that the program Jellyfish (Marcais and Kingsford 201 1) can do both 
operations much faster, which is essential in the case of large data 
sets (e.g., human genome). Jellyfish can output in FASTA format all 
fc-mers found after filtering; this FASTA file then is read by our 
YGS.pl program to build the bit-array. 

Description of the YGS method and computer implementation 

The genome sequence of males and females differ only by the 

Y chromosome. This, and the availability of inexpensive sequencing 
suggested that the pieces of the Y chromosome in assembled ge- 
nomes can be identified by the lack of matches to female short reads 
(Carvalho and Clark 2008). While doing this genome vs. female 
comparison, identical repeats of all types (e.g., transposable ele- 
ments, segmental duplications) should be removed from the ge- 
nome sequence because they will cause spurious matches to female 
reads, but variants in these repeats (called "sequence family vari- 
ants" or "singly unique nucleotide") contain valuable information 
(Krsticevic et al. 2010; Dennis et al. 2012; Hughes and Rozen 2012) 
and should be preserved along with all single-copy sequences. 
Preserving repeat variants is particularly important because the 

Y chromosome is repeat-rich; many of its "single-copy" regions 
actually are rare variants of some repeat (e.g., decayed transposable 
elements). Despite much effort, we could not achieve this selective 
masking using BLAST (due to its built-in heuristics), and tools such 
as RepeatMasker are designed to broadly mask known repeats. A 
straight comparison based on short DNA words (fc-mers) achieved 
the goal of a genome vs. female comparison that uses repeat vari- 
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ants and removes all types of identical repeats. We implemented it 
using bit-arrays, which can compactly store the presence (coded as 
"1") or absence ("0") of each k-mei. For our purposes, each k-mer 
and its reverse complement are equivalent; we stored whichever 
comes first lexicographically. Two bit-arrays were initially built, 
one for the /c-mers present in the female short reads ("F") and one 
for /c-mers that occurred more than once in the assembled genome 
(repetitive /c-mers, stored in bit-array "R"). In the final run, a third 
bit-array was obtained for each scaffold of the assembled genome 
("G") and compared to the two previous ones using logical oper- 
ations: G NOT R yields a bit-array with single-copy /c-mers of the 
scaffold ("GSC"); GSC NOT F yields the desired comparison be- 
tween the assembled genome (one scaffold a time) and the female 
short reads. We then obtained the proportion of unmatched 
single-copy /c-mers for each scaffold. Optionally, as described in 
a section below, a fourth bit-array was built for removal of se- 
quencing errors in the assembled genomes. The /c-mer size should 
be large enough so that identical /c-mers seldom occur by chance 
in the assembled genome or are generated by sequencing errors in 
the female reads (Kelley et al. 2010; Li et al. 2010). On the other 
hand, run time and memory usage scale by a factor of 4 k , which 
is the number of elements of the bit-array. We used k = 15 for 
Drosophila, and/c = 18 for humans, which are typical values used for 
filtering errors in short reads from these genomes (Kelley et al. 
2010; Li et al. 2010). The main sign of insufficient /c-mer size is the 
displacement of the peak containing the Y-linked scaffolds toward 
the autosomal/X-linked peak, which reduces the resolution of the 
method. We observed this effect when we analyzed the human 
genome using k = 16 instead of k = 18 (Supplemental Fig. S13, 
panels A,B). On the other hand, little additional resolution was 
gained by increasing k from 15 to 17 with the Drosophila data 
(Supplemental Fig. S13, panels C,D). So we suggest as initial values 
k = 15 for insect-like genomes (although a higher value would not 
harm) and k = 18 for vertebrate-like genomes. Going beyond k = 18 
seems unnecessary and would require —160 Gb of RAM memory. 
The final run of the Drosophila data (using k = 15) required ~9 h 
and 4 Gb of memory in a 2.33 GHz Linux machine; the human 
data (using k = 18) required 15 d and 40 Gb. Since each scaffold is 
processed separately in the final run, faster results are easily ob- 
tained by using multiple machines. Computer code (Supplemental 
Data File S2 YGS.pl) was implemented in Perl making extensive use 
of the Bit:: Vector module (S Beyer, unpubl.; http://search.cpan.org/ 
dist/Bit- Vector/) and runs in a single processor. 

As commented previously, the program Jellyfish (Marfais and 
Kingsford 2011) was used to filter the short reads. The program 
YGS.pl (Supplemental Data File S2) then does the main steps of the 
YGS method: extraction of /c-mers from the filtered short reads, 
extraction of repetitive /c-mers from the genomic scaffolds, and the 
comparison between the genomic and short-read /c-mers. Each step 
requires a separate run. A sample script with the commands used 
in all steps of the YGS method is included in the Supplemental 
Material. 

Although primarily designed to detect Y-linked (or W-linked) 
scaffolds, YGS seems to be a useful tool for detection of contami- 
nant (Supplemental Fig. SI OA) and low-quality scaffolds (Supple- 
mental Fig. S14), and for assembly comparisons/detection of new 
sequences in the presence of large amounts of repetitive DNA 
(Supplemental Fig. S5). 

Detection of contaminant scaffolds 

Most assembled genomes contain, at least in the initial releases, 
a small amount of contaminant sequences such as bacteria and 
yeast (e.g., Alkan et al. 2011). The YGS method will pinpoint any 
sequence of the assembled genome not present in the female reads, 



and hence some of the Y-linked scaffolds may actually be con- 
taminants. We investigated this possibility in the HuRef assembly 
by a variant of the YGS method: By using male (instead of female) 
short reads, all legitimate sequences (and only them) will be 
matched, whereas contaminant sequences will remain largely 
unmatched. This procedure assumes that the contaminant is ab- 
sent from the male reads, which is reasonable (except in cases such 
as a widespread virus, but in this case, being present in females, it 
will not be detected as Y-linked). 

As shown in Supplemental Figure SI OA, this procedure worked 
very well: Five HuRef scaffolds stood out, and BLAST analysis 
showed that all are bacterial sequences (99% identity to a plas- 
mid of Streptomyces clavuligerus). They were removed from all 
subsequent analyses (the accession numbers are: DS487629, 
DS489113, DS490066, DS490212, and DS490583). 

The above analysis was done including all /c-mers (repetitive 
and single-copy). When we repeated it using only the single-copy 
/c-mers, we observed an increased proportion of unmatched /c-mers 
in legitimate scaffolds, particularly on the small ones (Supplemental 
Fig. S10B). This increase is expected because both sequencing er- 
rors and rare polymorphisms usually generate single-copy /c-mers. 
Irrespective of their origin, these unmatched /c-mers most likely 
will not match the female short reads, and if present in an auto- 
somal (or X-linked) scaffold, will displace it toward the region of 
Y-linked scaffolds. This suggested that male short reads could be 
used not only to detect contaminant scaffolds but also to remove 
sequencing errors (and rare polymorphisms) from the legitimate 
scaffolds in order to improve the separation between Y- and not 
Y-linked scaffolds (see next section). 

Because we did not have male short reads from D. virilis, we 
searched for contaminant scaffolds in both D. virilis assemblies by 
a BLASTN search against a large database of bacterial genomes plus 
Saccharomyces cerevisae (the latter is used to feed Drosophila and is 
a common contaminant). Contamination was irrelevant in the CA 
assembly (we found one 200-bp piece of a vector inside a 1.3-Mbp 
scaffold), which was used for most analyses reported in this paper. 
The CAF1.2 assembly contains 39 small contaminant scaffolds, 
and all have a relevant amount (13%-100%) of bacterial sequences 
(vector sequences in many cases). They are identified in the Sup- 
plemental Data File S10 virCAT12_scafs_classification.pdf and were 
excluded from all subsequent analyses. As expected, they behaved 
as Y-linked sequences when compared to female short reads (not 
shown). 

Removal of sequencing errors in the assembled genomes 

Sequencing errors in the assembled genomes (and rare poly- 
morphisms in the HuRef donor) are expected to increase the pro- 
portion of unmatched /c-mers, and hence displace autosomal/ 
X-linked scaffolds toward the Y peak. We initially did not expect it 
to be a problem because sequencing errors should be rare, even in 
unfinished genomes. Indeed, in both the D. virilis and HuRef as- 
semblies, when we exclude the Y-linked scaffolds, the genome- 
wide proportion of single-copy /c-mers unmatched by female reads 
is low (D. virilis CA assembly: 0.15%; HuRef: 0.26%). However, we 
also found that in small scaffolds this proportion is substantially 
higher: For the non- Y-linked scaffolds smaller than 5 kb, it is 6.5% 
and 2.2% (D. virilis CA and human HuRef assemblies, respectively) 
(see Supplemental Fig. S10B). This inverse association with size 
probably has two causes: (1) In small scaffolds a larger share of the 
sequence came from the edges, which tend to have lower coverage 
(and hence higher error rate); and (2) traces containing errors will 
cause assembly breaks and result in smaller scaffolds. Whatever the 
cause, we found that removal of these sequencing errors yield 
a cleaner result for small scaffolds. Given the data that are avail- 



Genome Research 1903 

www.genome.org 



Carvalho and Clark 



able, we did the sequence error removal differently in Drosophila 
and human genomes, but the principle is the same. 

In the human data, we had access to Illumina short reads of 
40 Great Britain males (The 1000 Genomes Project Consortium 
2010); fc-mers not represented there were deemed as sequencing 
errors in the assembled genome (or rare polymorphisms in the 
HuRef donor) and removed from the analysis. Namely, we built an 
additional bit-array storing the fc-mers from male short reads ("M"); 
the bit-array containing the validated single-copy (VSC) fc-mers of 
each scaffold is obtained by GSC AND M; then VSC NOT F yields 
the desired assembled genome vs. female short reads comparison. 
This procedure indeed yields a clearer separation between Y- and 
non-Y-linked scaffolds (Supplemental Fig. SI 1). It removed 0.2% of 
the single-copy /c-mers across the genome and 4.5% in the case of 
small scaffolds. 

When male reads are not available (as inD. virilis), sequencing 
errors in the assembled genome can be partially removed by 
identifying regions of low quality, e.g., by using the quality values 
from the consensus sequences or from the Sanger traces used to 
build them. We chose the last option because the consensus quality 
values of the D. virilis assemblies were not available. We first 
downloaded from the NCBI trace archive the same Sanger traces 
used to assemble the D. virilis sequence (Drosophila 12 Genomes 
Consortium 2007) and filtered them at a phred score of 20 (i.e., all 
bases with >1% error rate were masked). Then we used the filtered 
Sanger traces to validate the genomic fc-mers, analogously to the use 
of male reads in the human data (see above). In the D. virilis CA 
assembly, this validation procedure removed 0.2% of the single- 
copy fc-mers across the genome and 4.6% in the case of small scaf- 
folds (<5 kb). As a control, we repeated the whole procedure using 
unfiltered Sanger traces; as expected, these values drop to nearly 
zero (genome-wide: 0.03% ; small scaffolds: 0.2%). The effect of the 
validation in the D. virilis CA assembly was clear (Supplemental Fig. 
S15) but perhaps less dramatic than in the human genome (Sup- 
plemental Fig. SI 1); a major benefit of this analysis is that it allowed 
us to detect the problems of the CAF1.2 assembly (next section). 

Comparison of the D. virilis assemblies 

We used two assemblies, CAF1.2 (available at ftp://ftp.flybase.net/ 
genomes/Drosophila_virilis/dvir_r 1 . 2_FB20 1 2_0 1 / f asta/dvir-all- 
chromosome-rl.2.fasta.gz) and CA (http://goo.gl/qLrG0). The 
D. virilis CAF1.2 is the assembly used in the Drosophila 12 Genomes 
Consortium (2007) and in FlyBase (McQuilton et al. 2012). It is 
derived from two primary assemblies done with the Arachne and 
the Celera assemblers (Drosophila 12 Genomes Consortium 2007; 
Zimin et al. 2008). While doing the removal of sequencing errors 
(see previous section), we noticed that the CAF1.2 assembly has 
a very large number of low quality, small scaffolds. Namely, the 
CAF1.2 assembly has 13,530 scaffolds, and only 2261 have >80% 
valid single-copy fc-mers (Supplemental Fig. S14A); among these 
2261, only 1056 have >50 valid single-copy fc-mers. As detailed in 
the previous section, we did this comparison using Sanger traces 
filtered at a phred score of 20 (i.e., 1% error), so —11,300 scaffolds 
(i.e., 13,530 minus 2261) contain a substantial amount of low 
quality sequence. Nearly all low-quality scaffolds are small (aver- 
age size 1527 bp; two have —100 kb); they are very rich in simple 
repeats: 46% of their sequence is masked by DUST (Morgulis et al. 
2006), perhaps reflecting the huge amount of satellite DNA in the 
D. virilis genome (Bosco et al. 2007). Removal of low quality scaf- 
folds and those containing less than 50 single-copy fc-mers results 
in an assembly with 1056 scaffolds and 1 70 Mbp (90% of the total 
sequence, 189 Mbp) and yields a clean separation between Y and 
not Y- linked scaffolds (Supplemental Fig. S16B). Supplemental 
Data File S10 virCAF12_scafs_classification.pdf lists all 13,530 



CAF1.2 scaffolds along with their classification (Y-linked, not 
Y-linked, intermediate, contaminant, low-quality, and small). 

Although the quality filtering solved at least part of the 
CAF1.2 problems, it is undesirable to rely only on such data, so we 
searched for alternative assemblies. Unfortunately, the original 
Arachne and Celera assemblies (Drosophila 12 Genomes Consor- 
tium 2007; Zimin et al. 2008) are no longer available (G Sutton, 
pers. comm.), but the assembly evaluation team of the Drosophila 
12 Genomes Consortium produced in 2005 different assemblies of 
D. virilis (using different assemblers) based on exactly the same 
Sanger traces. We examined two of them, "CA" (Celera Assembler) 
and "AR" (Arachne), using the same procedures we did with the 
CAF1.2 assembly. We found that the AR assembly was similar to 
CAF1.2, containing a large number of small, low quality scaffolds 
(not shown), which are absent from the CA assembly (Supple- 
mental Fig. S14, panels C,D). Unless otherwise noted, we used the 
CA assembly in our analyses to avoid the use of filtered data. It 
is similar to the filtered CAF1.2 assembly both in number of scaf- 
folds (1186) and in size (165 Mbp; all reported assembly sizes ex- 
clude gaps). The main results (namely, the identification of four 
new Y-linked genes and 1 1 pseudogenes) were the same with both 
assemblies, and all tested candidates are represented in the CAF1.2 
assembly (Supplemental Table SI). 

Sequence finishing and nomenclature of the new D. virilis 
Y-linked genes 

Due to the low coverage of the Y chromosome (Carvalho et al. 
2003) and its abundance of repetitive sequences, the sequences of 
almost all Y-linked genes have gaps and sequencing errors, and 
different exons of the same gene are scattered in several scaffolds 
(Carvalho et al. 2000; Koerich et al. 2008). We corrected these 
problems by directly sequencing the products from polymerase 
chain reactions with reverse transcription (RT-PCR) in the four 
new D. virilis Y-linked genes and in the majority of their Y-linked 
orthologs in other species. Their finished sequences were sub- 
mitted to GenBank (see Data Access). 

Given their fragmentary assembly, the Y-linked genes usually 
are incompletely annotated (or not annotated at all) during the 
automatic annotation step of genome projects (Carvalho et al. 
2000; Koerich et al. 2008), and this also happened in D. virilis. 
When available, we maintained the original names even when the 
original annotation covers a small portion of the gene. Namely, for 
the four D. virilis genes, the original annotations were as follows: 
G] 19835 was missing >50% of the gene (at the N terminus); G] 19633 
was complete; GJ1 1126 was missing one-third of the sequence (at the 
N terminus); and the D. virilis ortholog of the JYalpha was missing one- 
third of the sequence and split in two genes (GJ18574 and GJ18410); 
we used the name GJ18574 because it covers a larger region. 

Detection and validation of new human Y-linked sequences 

119 HuRef scaffolds were identified as Y-linked due to a low match 
to the female short reads (Supplemental Table S4). A sequence 
comparison is needed to distinguish among the 119 scaffolds those 
scaffolds covered by the reference Y from those containing new 
sequences. This comparison should take into account that as an 
unfinished sequence, HuRef scaffolds contain gaps, collapsed re- 
peats, and possibly large-scale misassemblies, such as inversion of 
contigs within scaffolds, so we do not expect perfect matching 
with the reference Y; these discrepancies should not be con- 
founded to "new sequence." We initially tried Nucmer (Delcher 
et al. 2002) to compare the 119 scaffolds to the reference Y, but the 
very high repeat content of the Y-linked sequences created prob- 
lems. We also considered using RepeatMasker, but as we suspected 
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and later confirmed, it would broadly mask most new Y se- 
quence (70% of the 283 kb we found). We searched potentially new 
Y-linked sequences using two complementary approaches de- 
scribed below. As described in Results, these approaches were 
sensitive enough to detect even an 1 1-kb piece of new sequence in 
the middle of the 2.3-Mbp scaffold DS48627 7. 

First, we ran the YGS.pl program on the 119 Y-linked scaf- 
folds, using the fc-mers extracted from the reference Y sequence 
(accession NC_000024.9) instead of fc-mers from female short 
reads. This procedure should distinguish Y-linked HuRef scaffolds 
covered by the reference Y from those containing new sequences 
(the latter having a high proportion of unmatched fc-mers) and is 
insensitive to the more trivial discrepancies mentioned above. As 
detailed in Results (see also Supplemental Fig. S5; Supplemental 
Table S4, column 5), the majority of the scaffolds (85 of 1 19) and of 
the sequence (18.165 Mbp of 18.371 Mbp) are entirely contained 
(or nearly so) within the reference Y sequence. Thirty-four small 
scaffolds are mostly composed of new sequence (>50% unmatched 
single-copy fc-mers) and likely came from the Yql2 and other 
heterochromatic regions of the Y chromosome, which are not in- 
cluded into the reference Y sequence (Results). A few of the 85 
Y-linked scaffolds that closely match the reference Y have a moder- 
ate amount of unmatched /c-mers (—10%) (Supplemental Fig. S5), 
which suggests that they too may contain a small amount of new 
sequence. The four largest are DS486171, DS486512, DS486519, 
and DS486628. The largest (DS486171) contains the misassembled 
XTR region; further analysis (below) shows that all its unmatched 
regions correspond to patches of the misassembled X-chromosome 
sequence, so it does not contain new sequence. The three next in 
size (DS486512, DS486519, DS486628) were shown to contain 
new sequence (below). 

The preceding approach may not detect a rather large amount 
of new sequence if it is inside a large scaffold, nor does it show 
where the new sequence is located within the scaffolds. The second 
approach addressed these issues. We used the BWA program (Li and 
Durbin 2009) to align the short reads from 40 Great Britain males 
(44-fold coverage of the genome) (The 1000 Genomes Project 
Consortium 2010) to the HuRef assembly under very high strin- 
gency (allowing at most one mismatch or one indel), and using 
BamTools (Barnett et al. 2011), extracted from the BAM file the 
reads that aligned to the 119 Y-linked scaffolds. We then aligned 
these reads to the reference Y sequence (using the same BWA pa- 
rameters), and now extracted the reads that did not align. These 
"differential male reads" represent the regions of the 119 Y-linked 
scaffolds not covered by the reference Y and are present in other 
males (i.e., potentially new regions of the Y chromosome). In order 
to locate them, we aligned the differential male reads to the 119 
scaffolds and used SAMtools (Li et al. 2009) to count the number of 
reads per scaffold (Supplemental Table S4, column 6). Finally, the 
HuRef assembly was aligned to female reads to confirm the Y- linkage 
of the 119 scaffolds. Scaffolds with a high number of hits of dif- 
ferential male reads possibly contain new sequence; we con- 
firmed this using the Integrative Genomics Viewer (IGV) browser 
(Thorvaldsdottir et al. 2012) to visualize the location and coverage 
of the three types of reads (male, differential male, female) within 
each of the 1 19 scaffolds (e.g., Fig. 6). 

We found that all 34 HuRef scaffolds that were deemed before 
as "new sequence" (Supplemental Fig. S5) also have a large num- 
ber of hits from differential male reads (Supplemental Table S4, 
column 6), which confirms that they contain new Y-linked se- 
quence. Among the 85 scaffolds that closely match the reference Y, 
we found three patterns of hits of differential male reads: ( 1) scaffolds 
that have very few or zero hits from the differential male reads; these 
scaffolds are completely subsumed into the reference Y; (2) scaffolds 
that contain the misassembled XTR region, characterized by a 



large number of hits from the differential male reads, unevenly 
distributed (DS486171, DS486351, etc.) (Supplemental Table S4); 
they also have many hits from female reads in the same region of 
differential male reads, so we dismissed them; and (3) scaffolds 
that contain a discrete region with a large number of hits from the 
differential male reads, evenly distributed and devoid of significant 
coverage by female reads (e.g., Fig. 6). The latter pattern provides 
unequivocal evidence of new Y-linked sequences; we found it in five 
regions, belonging to four scaffolds. Three of them were already 
detected with the previous approach (DS486628, DS486519, and 
DS486512); the remaining two regions are part of a large scaffold 
(DS486277), so their presence was inconspicuous in Supplemental 
Figure S5. Four additional scaffolds contain small patches (< 2 kbp) 
of new sequence (DS486286, DS486428, DS486288, and DS487109) 
and were not further studied. 

The above procedures control for all reasonable sources of 
artifacts. Two different approaches (Supplemental Table S4, col- 
umns 5,6) yield similar results (namely, both identified the 34 
small scaffolds, plus scaffolds DS486512, DS486519, and DS486628 
as containing new Y sequences). The k-mei comparison (Supple- 
mental Table S4, column 5) is insensitive to large scale mis- 
assemblies (contig inversions, etc.), which are common in repetitive 
regions. The alignment of male differential reads (Supplemental 
Table S4, column 6) is insensitive to contaminant DNA, plain se- 
quence errors in HuRef (which might mimic new sequence), and 
also to rare alleles/new mutations present in the HuRef donor be- 
cause the male reads would not match them. The visualization of 
the alignment of the three types of short reads (male, differential 
male, and female) (Fig. 6) allowed the separation of differences be- 
tween HuRef and the reference Y due to new sequence from artifacts 
caused by misassemblies (the later have hits against female reads 
in the same region of the differential male reads).The five newly 
found regions that extend the reference Y scaffolds are summarized 
in Supplemental Table S3. 

Data access 

The new nucleotide sequences reported in this manuscript have 
been submitted to GenBank (http://www.ncbi.nlm.nih.gov/ 
genbank/) as Third Party Annotation sequences under the ac- 
cession numbers TPA: BK008736-BK008744. 
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